基于空间结构的符号数据仿射传播算法<sup>*</sup>

doi:10.16451/j.cnki.issn1003-6059.201612009

摘要
图/表
参考文献
相关文章 (15)

全文: PDF (392 KB) HTML (1 KB)
输出: BibTeX | EndNote (RIS)

摘要由于符号型数据缺乏清晰的空间结构,很难构造一种合理的相似性度量,从而使诸多数值型聚类算法难以推广至符号型数据聚类.基于此种情况,文中引入一种空间结构表示方法,把符号型数据转化为数值型数据,能够在保持原符号型数据的结构特征的基础上重新构造样本之间的相似度.基于此方法,将仿射传播(AP)聚类算法迁移至符号数据聚类中,提出基于空间结构的符号数据AP算法(SBAP).在UCI数据集中若干符号型数据集上的实验表明,SBAP可以使AP算法有效处理符号型数据聚类问题,并且可以提升算法性能.

	服务

	把本文推荐给朋友
	加入我的书架
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王齐
	钱宇华
	李飞江

关键词 ：聚类, 符号型数据, 仿射传播(AP), 空间结构, 相似度

Abstract：Constructing a reasonable similarity measure is difficult due to the lack of clear space structure in categorical data. Therefore, numerical clustering algorithms can hardly be extended to categorical data clustering. In this paper, a representation method for transforming the categorical data into numerical data is introduced. The similarity between samples is reconstructured and the structure feature of the original categorical data is maintained in the reconstruction process. Based on the data representation method, the affinity propagation(AP) clustering algorithm is migrated to the categorical data clustering. A space structure based AP algorithm for categorical data(SBAP) is proposed. Experimental results on several categorical datasets from the UCI dataset show that the proposed method makes AP algorithm deal with the categorical data clustering problem effectively with a significant improvement in performance.

Key words： Clustering Categorical Data Affinity Propagation (AP) Space Structure Similarity

收稿日期: 2016-05-13

ZTFLH:

TP 181

基金资助:国家自然科学基金项目(No.61432011,U1435212,61322211)、教育部新世纪优秀人才支持计划(No.NCET-12-1031)、高等学校博士学科点专项科研基金(博导类)(No.20121401110013)、山西省高等学校优秀青年学术带头人项目(No.20120301)资助

作者简介: 王齐,男,1979年生,博士,讲师,主要研究方向为数据挖掘、知识发现.E-mail:wqking@sxu.edu.cn.钱宇华(通讯作者),男,1976年生,博士,教授,主要研究方向为人工智能、数据挖掘、机器学习.E-mail:jinchengqyh@sxu.edu.cn. 李飞江,男,1990年生,博士研究生,主要研究方向为数据挖掘、知识发现.E-mail:lfjfly@163.com.

引用本文:

王齐，钱宇华，李飞江. 基于空间结构的符号数据仿射传播算法^*[J]. 模式识别与人工智能, 2016, 29(12): 1132-1139. WANG Qi, QIAN Yuhua, LI Feijiang. Space Structure Based Affinity Propagation Algorithm for Categorical Data. , 2016, 29(12): 1132-1139.

链接本文:

http://manu46.magtech.com.cn/Jweb_prai/CN/10.16451/j.cnki.issn1003-6059.201612009 或 http://manu46.magtech.com.cn/Jweb_prai/CN/Y2016/V29/I12/1132

[1] CHMIELEWSKI M R, GRZYMALA-BUSSE J W. Global Discretization of Continuous Attributes as Preprocessing for Machine Learning. International Journal of Approximate Reasoning, 1996, 15(4): 319-331.
[2] DASH M, LIU H. Consistency-Based Search in Feature Selection. Artificial Intelligence, 2003, 151(1/2): 155-176.
[3] KOHAVI R, JOHN G H. Wrappers for Feature Subset Selection. Artificial Intelligence, 1997, 97(1/2): 273-324.
[4] ZHOU Z H. Three Perspectives of Data Mining. Artificial Intelligence, 2003, 143(1): 139-146.
[5] MACQUEEN J. Some Methods for Classification and Analysis of Multivariate Observations // Proc of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, USA: University of California Press, 1967: 281-297.
[6] MITCHELL T M. Machine Learning. New York, USA: McGraw-Hill, 1997.
[7] CHEN X Q, PENG H, HU J S. K-Medoids Substitution Clustering Method and a New Clustering Validity Index Method // Proc of the 6th World Congress on Intelligent Control and Automation. New York, USA: IEEE, 2006: 5896-5900.
[8] GUHA S, RASTOGI R, SHIM K. CURE: An Efficient Clustering Algorithm for Large Databases. Information Systems, 2001, 26(1): 35-58.
[9] GUHA S, RASTOGI R, SHIM K. ROCK: A Robust Clustering Algorithm for Categorical Attributes // Proc of the 15th International Conference on Data Engineering. Washington, USA: IEEE, 1999: 512-521.
[10] ZHANG T, RAMAKRISHNAN R, LIVNY M. BIRCH: An Efficient Data Clustering Method for Very Large Databases // Proc of the ACM SIGMOD International Conference on Management of Data. Menlo Park, USA: AAAI Press, 1996: 103-114.
[11] ESTER M, KRIEGEL H P, SANDER J, et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise // Proc of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park, USA: AAAI Press, 1996: 226-231.
[12] ANKERST M, BREUNIG M M, KRIEGEL H P, et al. OPTICS: Ordering Points to Identify the Clustering Structure // Proc of the ACM SIGMOD International Conference on Management of Data. Menlo Park, USA: AAAI Press, 1999: 49-60.
[13] HINNEBURG A, KEIM D A. An Efficient Approach to Clustering in Large Multimedia Databases with Noise // Proc of the 4th International Conference on Knowledge Discovery and Data Mining. Menlo Park, USA: AAAI Press, 1998: 58-65.
[14] WANG W, YANG J, MUNTZ R. STING: A Statistical Information Grid Approach to Spatial Data Mining[C/OL]. [2016-05-21]. http://www.vldb.org/conf/1997/P186.PDF.
[15] PARSONS L, HAQUE E, LIU H. Subspace Clustering for High Dimensional Data: A Review. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 90-105.
[16] RALAM B H. A Conceptual Version of the k-means Algorithm. Pattern Recognition Letters, 1995, 16(11): 1147-1157.
[17] HUANG Z X. Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Know-ledge Discovery, 1998, 2(3): 283-304.
[18] CHATURVEDI A, GREEN P E, CAROLL J D. k-modes Clus-tering. Journal of Classification, 2001, 18(1): 35-55.
[19] HUANG Z X, NG M K. A Fuzzy k-modes Algorithm for Clustering Categorical Data. IEEE Trans on Fuzzy Systems, 1999, 7(4): 446-452.
[20] GAN G, WU J, YANG Z. A Genetic Fuzzy k-modes Algorithm for Clustering Categorical Data. Expert Systems with Applications, 2008, 36(2): 1615-1620.
[21] GANTI V, GEHRKE J, RAMAKRISHNAN R. CACTUS-Clus-tering Categorical Data Using Summaries // Proc of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 1999: 73-83.
[22] BARBAR D, LI Y, COUTO J. COOLCAT: An Entropy-Based Algorithm for Categorical Clustering // Proc of the 11th International Conference on Information and Knowledge Management. New York, USA: ACM, 2002: 582-589.
[23] FISHER D H. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning, 1987, 2(2): 139-172.
[24] FREY B J, DUECK D. Clustering by Passing Messages between Data Points. Science, 2007, 315(5814): 972-976.
[25] QIAN Y, LI F, LIANG J, et al. Space Structure and Clustering of Categorical Data. IEEE Trans on Neural Networks and Learning Systems, 2015. DOI: 10.1109/TNNLS.2015.2451151.
[26] CHAN E Y, CHING W K, NG M K, et al. An Optimization Algorithm for Clustering Using Weighted Dissimilarity Measures. Pattern Recognition, 2004, 37(5): 943-952.
[27] BAI L, LIANG J, DANG C, et al. The Impact of Cluster Representatives on the Convergence of the k-modes Type Clustering. IEEE Trans on Pattern Analysis and Machine Intelligence, 2013, 35(6): 1509-1522.